Simple Linear Regression

In this notebook I will use data on house sales in King County to predict house prices using simple (one input) linear regression. I will:

Firing up Turi Create

Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

Split data into training and testing

Useful SFrame summary functions

In order to make use of the closed form solution as well as take advantage of turi create's built in functions I will first review some important ones. In particular:

As we see we get the same answer both ways

Building a generic simple linear regression function

Armed with these SArray functions I can now use the closed form solution found to compute the slope and intercept for a simple linear regression on observations stored as SArrays: input_feature, output.

I can test that my function works by passing it something where I already know the answer. In particular I can generate a feature and then put the output exactly on a line: output = 1 + 1*input_feature then I know both my slope and intercept would be 1

Now that I know it works I will be building a regression model for predicting price based on sqft_living.

Predicting Values

Now that I have the model parameters: intercept & slope I can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value.

Now that I can calculate a prediction given the slope and intercept I will be making my first prediction. I will find out the estimated price for a house with 2650 squarefeet according to the squarefeet model I estimated above.

Residual Sum of Squares

Now that I have a model and I can make predictions. I will evaluate my model using Residual Sum of Squares (RSS). RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

I will now test my get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero

Now I will use my function to calculate the RSS on training data from the squarefeet model calculated above.

Predicting the squarefeet given price

What if I want to predict the squarefoot given the price? Since I have an equation y = a + b*x I can solve the function for x. So that if I have the intercept (a) and the slope (b) and the price (y) I can solve for the estimated squarefeet (x).

Now that I have a function to compute the squarefeet given the price from my simple regression model let's see how big I might expect a house that costs $800,000 to be.

New Model : estimate prices from bedrooms

I have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. I will now use my simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms.

Testing my linear regression algorithm

Now I have two models for predicting the price of a house. How do we know which one is better? I will calculate the RSS on the TEST data for each of those models

It can be observed that RSS is is lower for the model based on square feet. Thus it can be said that square feet of a house is a more influential metric than number of bedrooms while deciding price of a house